Udacity Data Science Nanodegree: Capstone Project

Interpretable Machine Learning on Credit Risk Data


As part of the welcomed journey that has been the Nanodegree in Datascience for 2020, we are taken upon the quest of delivering a project that attests the buildup of our skills. This project will focus on Credit Risk and interpretable machine learning applications for credit professionals that are involved in the field of Data Science.

The source of our data is the historical Kaggle competition "Give me some Credit". The challenge regards in classifiying if a client will default in the payment of his financial obligations in a timeframe of 2 years.A welcome and sound challenge for joining professionals in the field.

Dataset Available on:https://www.kaggle.com/c/GiveMeSomeCredit/overview

Our main goal will be to implement MLI techniques (Mainly LIME) to results of a full Machine Learning process. MLI techniques are paramount to understand the record-by-feature granularity required from regulatory bodies across the globe. Furthermore, MLI techniques help to detect model bias on ethically dubious features.

Being able to fully understand and explain at the right level of granularity the results of a credit risk model, will help the community to grow tolerant in accepting the results of Ml models.

Metrics for this project should be:

  • A reasonably accurate model under classification metrics. F1-score / AUC
  • Correct implementation of MLI techniques on a subset of the model results

Analysis


The goal for this following section is for the reader to become familiar with the dataset we are working with plus potential pitfalls in the data we might need to address.

Data Exploration

Exploration of a dataset is paramount to a correct framing of our problem. Out objective is to predict an event of serious delinquency in the next 2 years based on 10 variables. In order to really understand our data and the vastness of the problem at hand we will focus on the next steps:

  • Dimensions of the dataset
  • Descriptive stats + Probability distribution plots
  • Nan Detection
  • Outlier Detection + Box plots
  • Correlation Matrix
  • Model Constraints from data related problems
  • Possible candidates of features

Dimensions and Descriptive stats

The Give Me Some Credit dataset has a recollection of important features regarding the potential credit risk of the Client. We have a total of 150.000 different clients available to user in our training excercise. The feature space for each client ranges from financial standing features, financial status to demographic.

Descriptive statistics of the dataset are as follows

In [4]:
data.describe()
Out[4]:
SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
count 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 1.202690e+05 150000.000000 150000.000000 150000.000000 150000.000000 146076.000000
mean 0.066840 6.048438 52.295207 0.421033 353.005076 6.670221e+03 8.452760 0.265973 1.018240 0.240387 0.757222
std 0.249746 249.755371 14.771866 4.192781 2037.818523 1.438467e+04 5.145951 4.169304 1.129771 4.155179 1.115086
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.029867 41.000000 0.000000 0.175074 3.400000e+03 5.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.154181 52.000000 0.000000 0.366508 5.400000e+03 8.000000 0.000000 1.000000 0.000000 0.000000
75% 0.000000 0.559046 63.000000 0.000000 0.868254 8.249000e+03 11.000000 0.000000 2.000000 0.000000 1.000000
max 1.000000 50708.000000 109.000000 98.000000 329664.000000 3.008750e+06 58.000000 98.000000 54.000000 98.000000 20.000000

The Give Me Some credit competition support team kindly offered a dictionary for this dataset that is summarized for reader support

  • SeriousDlqin2yrs [int] : Person Experienced 90 days past due delinquency or worse
  • RevolvingUtilizationOfUnsecuredLines [float] : Total balance on credit cards and personal lines of credit, divided by the sum of credit limits
  • Age [int]: Age borrower in years
  • NumberOfTime30-59DaysPastDueNotWorse [int]: Number of times the borrower has been 30-59 days past due but not worse in the last 2 years
  • DebtRatio [float]: Monthly debt payments, alimony, living costs divided by monthly gross income
  • MonthlyIncome [float]: Monthly Income of customer
  • NumberOfOpenCreditLinesAndLoans [int]: Number of Open oans (installments like car loan or mortgage) and Lines of Credit
  • NumberOfTimes90DaysLate [int]: Number of times borrower has been 90 days or more past due.
  • NumberRealEstateLoansOrLines [int]: Number of mortgage and real estate loans including home equity lines of credit
  • NumberOfTime60-89DaysPastDueNotWorse [int]: Number of times borrower has been 60-89 days past due but not worse in the last 2 years
  • NumberOfDependents [int]: Number of dependents in family excluding themselves.

Regarding theoretical limits for variables, Utilization cannot be negative (unless overdraft). High Debt to Income Ratios should not be processed as outliers, when income becomes 0 debt to income limit approaches infinity. Negative incomes are data errors, possible outliers regarding demographic data

Data Visualization

We will start visualization plots always taking into account our target feature: Loan Default in the next 2 years, as it is a part of a bank livelihood, credit default rates should be consistently imbalanced.

We have a total of 6.6% of the 150k population that has incurred in serious default events in two years. Given the nature of the feature and the dataframe I suspect this information is collected from a Mortgage portfolio.

The distribution of the target feature is presented as follows:

In [241]:
pd.DataFrame(data['SeriousDlqin2yrs'].value_counts(normalize=True))
Out[241]:
SeriousDlqin2yrs
0 0.93316
1 0.06684
In [242]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
labels = ['Default', 'Non-Default']
values = [(data[data['SeriousDlqin2yrs']==1].sum()['SeriousDlqin2yrs'])/150000, (150000-data[data['SeriousDlqin2yrs']==1].sum()['SeriousDlqin2yrs'])/150000]
ax.bar(labels,values, color= ['red', 'green'], alpha=0.7)
plt.title('Figure 1. Distribution of Target Feature: SeriousDlqin2yrs')
plt.show()

As part of Portfolio theory, the goal of a risk manager is to select the optimal return for the associated risk on the investment. The bank provides loan applicants with the funds required for an especific purpose and in return it gets the interest on the money lent.[0]

A correct measurement of risk allows the bank to maximize its returns with an expected level of delinquency on its loans the goal of our model is to predict as accurate as possible this suggested level of delinquency in the next two years.

In addition to the target feature analysis, we provide expected signs for our feature space as well as visualizations for each feature in Annex(A).CreditDefaultEDA

The results from our EDA analysis are summarized on this table. All of this results come from a main question: What worries me about this feature?

Feature Name Expected Sign Outlier Problems Data Problems
SeriousDlqin2yrs NA NA Unbalanced
RevolvingUtilizationOfUnsecuredLines - NA Values > 1 suggest problems. Informative
Age + NA Remove underage population
NumberOfTime30-59DaysPastDueNotWorse - NA Nan Treatment, encode Values
DebtRatio - NA NaN not associated to default, non reliable
MonthlyIncome + NA NaN treatment, standarize
NumberOfOpenCreditLinesAndLoans - Cap at 10 Standarize
NumberOfTimes90DaysLate - Encode, Relevant feature, possible leak
NumberRealEstateLoansOrLines +
NumberOfTime60-89DaysPastDueNotWorse - Encode
NumberOfDependents -

The logic behind expected signs is as follows. Users in financial distress tend to start using their available resources to cover the leak in their financial distress. A high utilization at high rates suggest a troublesome set.For example, A low level of liquidity, low financial knowledge sophistication ( Highest borrowing rates in the market) or non expected financial windfall, among others.

As Warren buffet once said: " You cannot go through life borrowing at those rates" [1]

The logic behind age comes from dynamic Macroeconomic theory and the financial lifecycle. We expect (segregated by income) a Solow's type of income stock (Savings) were the bulk of the individual's life expenditure diminishes as life progresses,and income grows until it becomes static at retirement. [2]

Debt to income ratios suggest how much the incurred loans from the client are a burden to its day to day life, accurate income measuring its a problem in itself but when a client report 0 income to the bank, it is in the bank's best interest to offer reestructuring facilities to weather the crisis to a client. [3]

In practice however, a low debt to income ratio paradoxically has been reported to be associated to a decrease in payment rates as well. Small amounts of debt seem to easily escape the to-do list of high income segments of population.

Regarding Past-due Events our logic relies on Merton's modelling for financial assets. The closest you have been to default should weight on how likely you are to default in the future. Or, the standard deviation from expected returns should be priced proportionally to the expected return. A simple but elegant solution to asset pricing. [4]

This type of analysis are not possible on high dimensional feature spaces but from time to time, this low dimentional feature spaces give way for some leniency on the effect of theory on the expected results from our model

Methodology


Data Preprocessing

As part of our preprocessing techniques we will:

  • Windsorize outliers according to bins defined by IQR
  • Encode Missing Values

Null treatment

Identify clear Nan Values, replace with median. Median preferred due to robustness on data, and feasible value. The current state of missing values on the training set is displayed in the table below

In [46]:
pd.DataFrame(data.isna().sum(), columns = ['Missing Values']) # Diagnose Dataframe
Out[46]:
Missing Values
SeriousDlqin2yrs 0
RevolvingUtilizationOfUnsecuredLines 0
age 0
NumberOfTime30-59DaysPastDueNotWorse 0
DebtRatio 0
MonthlyIncome 29731
NumberOfOpenCreditLinesAndLoans 0
NumberOfTimes90DaysLate 0
NumberRealEstateLoansOrLines 0
NumberOfTime60-89DaysPastDueNotWorse 0
NumberOfDependents 3924
RevolvingUtilizationOfUnsecuredLines_out 0
age_out 0
NumberOfTime30-59DaysPastDueNotWorse_out 0
DebtRatio_out 0
MonthlyIncome_out 0
NumberOfOpenCreditLinesAndLoans_out 0
NumberOfTimes90DaysLate_out 0
NumberOfTime60-89DaysPastDueNotWorse_out 0
In [47]:
# Create Boolean Features
data['MonthlyIncome_na'] = data['MonthlyIncome'].isna()
data['NumberOfDependents_na'] = data['MonthlyIncome'].isna()

# Input median on problem features
data['MonthlyIncome'] = data['MonthlyIncome'].fillna(data['MonthlyIncome'].median())
data['NumberOfDependents'] = data['NumberOfDependents'].fillna(data['NumberOfDependents'].median())

# Check data
pd.DataFrame(data.isna().sum(), columns=['Missing Values'])
Out[47]:
Missing Values
SeriousDlqin2yrs 0
RevolvingUtilizationOfUnsecuredLines 0
age 0
NumberOfTime30-59DaysPastDueNotWorse 0
DebtRatio 0
MonthlyIncome 0
NumberOfOpenCreditLinesAndLoans 0
NumberOfTimes90DaysLate 0
NumberRealEstateLoansOrLines 0
NumberOfTime60-89DaysPastDueNotWorse 0
NumberOfDependents 0
RevolvingUtilizationOfUnsecuredLines_out 0
age_out 0
NumberOfTime30-59DaysPastDueNotWorse_out 0
DebtRatio_out 0
MonthlyIncome_out 0
NumberOfOpenCreditLinesAndLoans_out 0
NumberOfTimes90DaysLate_out 0
NumberOfTime60-89DaysPastDueNotWorse_out 0
MonthlyIncome_na 0
NumberOfDependents_na 0

Outlier treatment

As part of outlier treatment of variables we will clip outliers according to the interquartile range of the population

  • RevolvingUtilizationOfUnsecuredLines: [0, 0.029867442, 0.154180737, 0.5590462475, 1.35281445575, 2.146582664, inf]
  • Age: [0, 18, 20, 41, 52, 63, 80, 93, float('Inf')]
  • NumberOfTime30-59DaysPastDueNotWorse: [0, 0.99, 1, 2, 4, 6, 10, 13, float('Inf')]
  • DebtRatio: [-inf, -1.9044659907499997, -0.8646960792499998, 0.17507383225, 0.366507841, 0.86825377325, 1.9080236847499998, 2.9477935962499995, inf]
  • MonthlyIncome: [float('-Inf'), 0, 3400, 5400, 8000, 10000, 15000, 20000, 23000, float('Inf')]
  • NumberOfOpenCreditLinesAndLoans: [float('-inf'), 0, 5, 8, 11, 20, 29, float('inf')]
  • NumberOfTimes90DaysLate: [float('-inf'), 0, 1, 2, 3, 8, 10, float('inf')]
  • NumberOfTime60-89DaysPastDueNotWorse: bins = [float('-inf'), 0, 1, 2, 3, float('inf')]
In [43]:
# Encode Outliers
data['RevolvingUtilizationOfUnsecuredLines_out'] = np.where(data['RevolvingUtilizationOfUnsecuredLines']>=2.14, 1, 0)
data['age_out'] = np.where(data['age'] >= 90, 1, 0)
data['NumberOfTime30-59DaysPastDueNotWorse_out'] = np.where(data['NumberOfTime30-59DaysPastDueNotWorse'] > 13, 1, 0)
data['DebtRatio_out'] = np.where(data['DebtRatio'] > 2.94, 1, 0)
data['MonthlyIncome_out'] = np.where(data['MonthlyIncome'] > 23000, 1, 0)
data['NumberOfOpenCreditLinesAndLoans_out'] = np.where(data['NumberOfOpenCreditLinesAndLoans'] > 20, 1, 0)
data['NumberOfTimes90DaysLate_out'] = np.where(data['NumberOfTimes90DaysLate'] > 10, 1, 0)
data['NumberOfTime60-89DaysPastDueNotWorse_out'] = np.where(data['NumberOfTime60-89DaysPastDueNotWorse'] > 3, 1, 0)
In [44]:
 
Out[44]:
SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents RevolvingUtilizationOfUnsecuredLines_out age_out NumberOfTime30-59DaysPastDueNotWorse_out DebtRatio_out MonthlyIncome_out NumberOfOpenCreditLinesAndLoans_out NumberOfTimes90DaysLate_out NumberOfTime60-89DaysPastDueNotWorse_out
0 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0 0 0 0 0 0 0 0 0
1 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0 0 0 0 0 0 0 0 0
2 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0 0 0 0 0 0 0 0 0
3 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0 0 0 0 0 0 0 0 0
4 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0 0 0 0 0 1 0 0 0

After preprocessing the integrity of the information holds. No records were omitted during the cleaning process, just adjusting outliers to max values inside each feature distribution

Implementation

The first approach towards model creation will be with a random Forest Classifier with the standard parameters assigned by scikit-learn.A Train/Test split at 70% of the population is defined to avoid overfitting. The model should optimize for classification metrics such as AUC, or the components of the F1-Score.

In [48]:
X = data[['RevolvingUtilizationOfUnsecuredLines',
          'age',
          'NumberOfTime30-59DaysPastDueNotWorse',
          'DebtRatio',
          'MonthlyIncome',
          'NumberOfOpenCreditLinesAndLoans',
          'NumberOfTimes90DaysLate',
          'NumberOfTime60-89DaysPastDueNotWorse',
          'NumberOfDependents'

          ]]
y = data['SeriousDlqin2yrs']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42)
In [49]:
#--- memory consumed by train dataframe ---
mem = X_train.memory_usage(index=True).sum()
print("Memory consumed by training set  :   {} MB" .format(mem/ 1024**2))
print('\n') 
#--- memory consumed by test dataframe ---
mem = X_test.memory_usage(index=True).sum()
print("Memory consumed by test set      :   {} MB" .format(mem/ 1024**2))
Memory consumed by training set  :   8.0108642578125 MB


Memory consumed by test set      :   3.4332275390625 MB

Before training our models, a quick memory sanity check is in order. The split for the training set consumes 8MB, and the Test split 3.43MB. Given the order of complexity of the models to be used, no memory shortage realted problems should arise.

Now, regarding the model, we will train a industry scikit-learn standard parameters model. More information available at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

  • RandomForestClassifier(random_state=42)
    • bootstrap': True
    • ccp_alpha': 0.0
    • class_weight': None
    • criterion': 'gini'
    • max_depth': None
    • max_features': 'auto'
    • max_leaf_nodes': None
    • max_samples': None
    • min_impurity_decrease': 0.0
    • min_impurity_split': None
    • min_samples_leaf': 1
    • min_samples_split': 2
    • min_weight_fraction_leaf': 0.0
    • n_estimators': 100
    • n_jobs': None
    • oob_score': False
    • random_state': 42
    • verbose': 0
    • warm_start': False

A pipeline object is defined in case further preprocessing techniques should arise in the future. The reported AUC score regarding this model is 0.8405 the confusion matrix is displayed below

In [57]:
pipeline = Pipeline([
    ('classifier', RandomForestClassifier(random_state = 42))
])
pipeline.fit(X_train, y_train)
Out[57]:
Pipeline(steps=[('classifier', RandomForestClassifier(random_state=42))])
In [74]:
preds = pipeline.predict(X_test)
np.mean(preds == y_test)
accuracy_score(preds, y_test)
print(classification_report(y_test, preds))
              precision    recall  f1-score   support

           0       0.94      0.99      0.97     42020
           1       0.53      0.19      0.28      2980

    accuracy                           0.94     45000
   macro avg       0.74      0.59      0.62     45000
weighted avg       0.92      0.94      0.92     45000

In [81]:
print('AUC Score of: ', roc_auc_score(y_test, pipeline.predict_proba(X_test)[:,1]))
AUC Score of:  0.8404912689387286

Refinement

We implement GridSearch cross validation as implemented on Sklearn.model_selection, the chosen grid of hyperparameters are defined as follows:

  • n_estimators: [20, 30, 50]
  • max_depth: [2, 4]
  • min_samples_leaf': [2, 4] The best parameter combination after validation is Depth of 4, n_estimators of 20 and min_sample_lead of 4

Even a small change upon the standard chosen hyperparameters help in the classification metrics as it can be seen on the ROC plot below.

In [63]:
from sklearn.model_selection import GridSearchCV

hyperparameters = {  'classifier__n_estimators': [20, 30, 50],
                    'classifier__max_depth': [2, 4],
                    'classifier__min_samples_leaf': [2, 4]
                  }
clf = GridSearchCV(pipeline, hyperparameters, cv = 3)
 
# Fit and tune model
clf.fit(X_train, y_train)
Out[63]:
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('classifier',
                                        RandomForestClassifier(random_state=42))]),
             param_grid={'classifier__max_depth': [2, 4],
                         'classifier__min_samples_leaf': [2, 4],
                         'classifier__n_estimators': [20, 30, 50]})
In [55]:
preds = clf.predict(X_test)
np.mean(preds == y_test)
Out[55]:
0.9365111111111111
In [72]:
rfc_no_cv = plot_roc_curve(pipeline, X_test, y_test)
rfc_cv    = plot_roc_curve(clf, X_test, y_test, ax=rfc_no_cv.ax_)
rfc_cv.figure_.suptitle("ROC curve comparison")
Out[72]:
Text(0.5, 0.98, 'ROC curve comparison')

The now refined model displays better AUC.

In [84]:
print(classification_report(y_test, preds))
              precision    recall  f1-score   support

           0       0.94      0.99      0.97     42020
           1       0.53      0.19      0.28      2980

    accuracy                           0.94     45000
   macro avg       0.74      0.59      0.62     45000
weighted avg       0.92      0.94      0.92     45000

Results


Model Evaluation and Validation

The model parameters are not the main objective of this article, nonetheless, the model displays a precision of 0.94 across the test set, with areas under the receiving operator curve of 0.85.

The relatively low f1 score relies entirely on the use-case. If this model is used entirely for accepting/rejecting loans, extreme care should be taken of not excluding a large percentage of the candidate population.

A common approach at explaining the results of a non-linear model, relies on plotting the Variable Importance for the whole population we will follow but also expand on the caveats of this approach.

In [95]:
pipeline.steps[0][1].feature_importances_
Out[95]:
array([0.20006891, 0.12987678, 0.04971695, 0.19124198, 0.15444512,
       0.09044784, 0.09123611, 0.05008502, 0.04288129])
In [104]:
clf.best_estimator_.steps[0][1].feature_importances_
Out[104]:
array([0.18629317, 0.01998433, 0.14254297, 0.00563108, 0.00333057,
       0.00698334, 0.36004074, 0.27470008, 0.00049371])
In [109]:
feature_imp = pd.Series(clf.best_estimator_.steps[0][1].feature_importances_,index=X.columns).sort_values(ascending=False)


# Creating a bar plot

sns.barplot(x=feature_imp, y=feature_imp.index, color="salmon")
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()
No handles with labels found to put in legend.

Altough variable Importance shows us an important insight, regulators require granularity regarding client offerings/refusals. In order to comply to the most strict regulations we need to be able to provide reason codes for each client refused by the bank. Lets take the worst top 10 predicted probabilities on our test set.

In [212]:
print(pd.DataFrame(clf.predict_proba(X_test)).head(5))
type(X_test)
          0         1
0  0.980796  0.019204
1  0.979375  0.020625
2  0.974721  0.025279
3  0.912757  0.087243
4  0.914019  0.085981
Out[212]:
pandas.core.frame.DataFrame
In [215]:
X_test_prob_df = pd.DataFrame(clf.predict_proba(X_test))
X_test_df = X_test.reset_index()                               
In [216]:
X_test_merge = X_test_df.merge(X_test_prob_df, left_index=True, right_index = True)
In [237]:
top10_worst_clients = X_test_merge.sort_values(by=1, ascending = False).head(10)
top10_worst_clients
Out[237]:
index RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents 0 1
41750 69189 1.224775 37 4 1921.000000 5400.0 9 2 3 0.0 0.353131 0.646869
42136 59128 1.636430 41 3 0.651415 3780.0 6 3 3 3.0 0.361929 0.638071
42231 62640 1.279819 61 4 921.000000 5400.0 11 3 1 0.0 0.364341 0.635659
30796 113042 1.260924 47 2 3512.000000 5400.0 5 3 1 0.0 0.365426 0.634574
43287 42081 1.173509 33 3 0.430443 5570.0 13 5 2 2.0 0.365590 0.634410
5392 43443 1.255814 42 3 0.477626 3083.0 6 3 2 3.0 0.372490 0.627510
24984 4081 1.201452 57 2 1906.000000 5400.0 9 4 1 0.0 0.373919 0.626081
19060 140025 1.161119 52 4 0.581637 2308.0 7 2 1 3.0 0.375208 0.624792
451 18633 1.671105 52 3 0.513251 4527.0 8 3 1 4.0 0.375208 0.624792
13749 118412 1.074221 47 2 3.354767 450.0 8 2 1 2.0 0.375578 0.624422
In [122]:
import lime
import lime.lime_tabular
In [133]:
predict_fn_rf = lambda x: clf.predict_proba(x).astype(float)
In [164]:
predict_fn_rf = lambda x: clf.predict_proba(x).astype(float)
X = X_train.values
explainer = lime.lime_tabular.LimeTabularExplainer(X,feature_names = X_train.columns,class_names=[0,1],kernel_width=5)
In [225]:
y_test.loc[[69189]]
Out[225]:
69189    1
Name: SeriousDlqin2yrs, dtype: int64
In [228]:
choosen_instance = X_test.loc[[69189]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
In [231]:
exp.show_in_notebook(show_all=False)
In [232]:
choosen_instance = X_test.loc[[59128]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
exp.show_in_notebook(show_all=False)
In [238]:
choosen_instance = X_test.loc[[62640]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
exp.show_in_notebook(show_all=False)
In [239]:
choosen_instance = X_test.loc[[42081]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
exp.show_in_notebook(show_all=False)

Justification

As it has been observed, variable importance on classifiers is just not enough for the level of granularity required to convince regulators and managers of the benefits of using ML techniques in Credit Risk.

This has been a small example that an accurate, fully explained model is possible, under current economic/Finance theory and by explaining client by client the reasoning of the coded model, the benefits and potential insights across all of the studied population are inmense.

The benefits of using non-linear approaches to find clusters of explainable phenomena across a banks portfolio are exactly what is needed now in times of distress and global economic recovery.

Conclusion


Reflection

To recapitulate what we've done we:

  • Used relevant Credit Risk data from a global competition
  • Explored, and processed the data
  • Trained a Random Forest classifier and then iterated to find the best parameters
  • Displayed Variable Importance for the results
  • Implemented a Machine Learning Interpretability technique to fully explain the effect of each feature on the predicted probability using LIME

The latest part seemed the most difficult. LIME is still on its early releases, it is still not fully scalable and the GUI elements from the explainer are still a bit unflexible but the power of explaining at the record level is indeed impressive.

Improvement

There are some caveats at linearizing the probability space across the joint distributions, the explanations are only as good as the model linearization is. Further research on the algorithm is still needed to fully scale the results for the tens of thousands of applications any small bank receives per day.

The H2O solution for these problems was the developement of K-Lime. Displaying the results of the R^2 for each cluster linearization. Other (also non scalable) technique for MLI is the use of Shapley Values, a game theoretical approach in finding the contribution of each feature to the added predicted probability.

Further improvements to be done on this implementation could be regarding the preprocessing pipeline. Feature Engineering (As long as it doesn't break interpretability) and a more extensive CV search across all possible and recurrent best parameters for classification techniques.